Fitting an unconditional model (without predictors) Robust estimates – Gaussian vs. t-distribution
How to fitting an unconditional model (without predictors)? From here, we can see the philosophy of the Bayesian’s perspective.
Scale y | Dist. \(y \sim f(\mu, ...)\) | Inverse link \(\mu = g^{-1}(\eta)\) |
---|---|---|
Metric | \(y \sim dnorm(\mu,\sigma)\) | \(\mu = \eta\) |
Binary | \(y \sim dbern(\mu)\) | \(\mu = logistic(\eta)\) |
Nominal | \(y \sim dmultinom(\mu_1, ...,\mu_p)\) | \(\mu_k = \frac{exp(\eta_k)}{\sum_j exp(\eta_j)}\) |
Ordinal | \(y \sim dmultinom(\mu_1, ...,\mu_p)\) | \(\mu = \Phi(\frac{\theta_k - \eta}{\sigma}) - \Phi(\frac{\theta_{k-1} - \eta}{\sigma})\) |
Count | \(y \sim dpois(\mu)\) | \(\mu = exp(\eta)\) |
We just need to estimate both parameters of normal distribution for random variable \(y\).
# preparing data
set.seed(03172021)
<- rnorm(100, mean = 6, sd = 3)
y = length(y) Ntotal
Create a data list, include sample mean and standard deviation in it.
= list(
dataList y = y ,
Ntotal = Ntotal ,
mean_mu = mean(y) ,
sd_mu = sd(y)
)
Specify the model.
Recall that in JAGS [@RN733] normal distribution is specified by precision \(\frac{1}{\sigma^2}\) instead of standard deviation or variance. Select an uninformative prior for \(\sigma\) and normal (conjugate) prior for \(\mu\).
What do we suggest as parameters of the normal distribution based on the sample? \[ y_i \sim N(\mu,\tau), \ \text{where} \ \tau = \frac{1}{\sigma^2} \\ \mu \sim N(M,S) \\ \sigma \sim Uniform[Lw,Mx] \ \text{OR} \ Gamma(\alpha, \beta) \]
= "
modelString model {
for (i in 1:Ntotal) {
y[i] ~ dnorm(mu , 1/sigma^2) #JAGS uses precision
}
# mu ~ dnorm(mean_mu , 1/(100*sd_mu)^2)
mu ~ dnorm(mean_mu , Ntotal/sd_mu^2) #JAGS uses precision
sigma ~ dunif(sd_mu/1000, sd_mu*1000)
}
" # close quote for modelString
# Write out modelString to a text file
writeLines(modelString, con="data/TEMPmodel.txt")
Initialize chains.
<- function(){
initsList <- sample(c(1,-1),1)
upDown <- mean(y)*(1+upDown*.05)
m <- sd(y)*(1-upDown*.1)
s list(mu = m, sigma = s)
}
Run the chains.
# Create, initialize, and adapt the model:
= c("mu", "sigma") # The parameters to be monitored
parameters = 500 # Number of steps to "tune" the samplers
adaptSteps = 1000
burnInSteps = 50000
numSavedSteps = 4
nChains = 1
thinSteps = ceiling((numSavedSteps*thinSteps)/nChains ) nIter
= jags.model("data/TEMPmodel.txt", data=dataList, inits=initsList,
jagsModel n.chains=nChains, n.adapt=adaptSteps)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 100
Unobserved stochastic nodes: 2
Total graph size: 114
Initializing model
# Burn-in:
update(jagsModel, n.iter=burnInSteps)
# Run it
# The saved MCMC chain:
= coda.samples(jagsModel, variable.names=parameters,
codaSamples n.iter=nIter, thin=thinSteps)
Check how chains converged.
summary(codaSamples)
Iterations = 1501:14000
Thinning interval = 1
Number of chains = 4
Sample size per chain = 12500
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 5.923 0.2265 0.001013 0.001001
sigma 3.208 0.2325 0.001040 0.001407
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 5.480 5.769 5.922 6.076 6.366
sigma 2.792 3.044 3.194 3.356 3.698
The parameters are estimated close to what we simulated and very similar to what point estimation would give.
mean(y)
[1] 5.922201
sd(y)
[1] 3.173288
The plot of the samples and the densities of the parameters.
plot(codaSamples)
Plot autocorrelations.
autocorr.plot(codaSamples, ask = F)
Autocorrelation function plot shows that standard deviation effective
size must be pretty small.
In fact:
effectiveSize(codaSamples)
mu sigma
51236.85 27414.72
Shrink factor shows that even with long memory for standard deviation distributions converged:
gelman.diag(codaSamples)
Potential scale reduction factors:
Point est. Upper C.I.
mu 1 1
sigma 1 1
Multivariate psrf
1
gelman.plot(codaSamples)
Observed HDIs of the chains:
lapply(codaSamples,function(z) hdi(as.matrix(z)))
[[1]]
mu sigma
lower 5.479068 2.774907
upper 6.362111 3.677393
attr(,"credMass")
[1] 0.95
[[2]]
mu sigma
lower 5.488275 2.766543
upper 6.367365 3.661088
attr(,"credMass")
[1] 0.95
[[3]]
mu sigma
lower 5.475716 2.778049
upper 6.353808 3.677524
attr(,"credMass")
[1] 0.95
[[4]]
mu sigma
lower 5.489742 2.782062
upper 6.386876 3.677348
attr(,"credMass")
[1] 0.95
= "
modelString data {
int<lower=1> Ntotal;
real y[Ntotal];
real mean_mu;
real sd_mu;
}
transformed data {
real unifLo;
real unifHi;
real normalSigma;
unifLo = sd_mu/100;
unifHi = sd_mu*100;
normalSigma = sd_mu*100; // 100*10 times larger than MLE
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
sigma ~ uniform(unifLo, unifHi);
mu ~ normal(mean_mu, normalSigma);
y ~ normal(mu, sigma);
}
" # close quote for modelString
Create a DSO and save it to disk to reuse later or just keep it in memory.
<- stan_model(model_code = modelString) stanDso
Run chains by either using the existing DSO:
<- sampling(object=stanDso,
stanFit data = dataList,
pars=c('mu', 'sigma'),
chains = 2,
cores= 2,
iter = 5000,
warmup = 200,
thin = 1)
Or alternatively, use the description of the model saved in
ch16_1.stan
directly:
# fit model
<- stan(file = "modelString.stan",
fit data = list(Ntotal = length(y),
y = y,
meanY=mean(y),
sdY=sd(y)),
pars=c('mu', 'sigma'),
# control=list(adapt_delta=0.99),
iter=5000, chains = 2, cores = 2
)
Objects fit
and stanFit
should return very
similar results. The difference between stan()
and
sampling()
is in the argument object
which is
DSO. If you expect to repeat same calculations with different data
compiling a DSO and reusing it with sampling()
is
faster.
# text statistics:
print (stanFit)
Inference for Stan model: anon_model.
2 chains, each with iter=5000; warmup=200; thin=1;
post-warmup draws per chain=4800, total post-warmup draws=9600.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
mu 5.92 0.00 0.32 5.30 5.71 5.92 6.14 6.57
sigma 3.22 0.00 0.23 2.80 3.06 3.20 3.36 3.72
lp__ -164.84 0.02 1.01 -167.54 -165.23 -164.53 -164.11 -163.85
n_eff Rhat
mu 8953 1
sigma 7630 1
lp__ 4176 1
Samples were drawn using NUTS(diag_e) at Thu Feb 23 20:07:56 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
# estimates & hdi:
plot(stanFit)
# samples
traceplot(stanFit, ncol=1, inc_warmup=F)
pairs(stanFit, pars=c('mu','sigma'))
stan_scat(stanFit, c('mu', 'sigma'))
stan_hist(stanFit)
stan_dens(stanFit)
# autocorrelation:
stan_ac(stanFit, separate_chains = T)
# or work with familiar coda class:
<- function(fit) {
stan2coda # apply to all chains
mcmc.list(lapply(1:ncol(fit), function(x) mcmc(as.array(fit)[,x,])))
}<- stan2coda(stanFit)
codaSamples summary(codaSamples)
Iterations = 1:4800
Thinning interval = 1
Number of chains = 2
Sample size per chain = 4800
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 5.923 0.3225 0.003291 0.003330
sigma 3.216 0.2344 0.002392 0.002609
lp__ -164.839 1.0128 0.010337 0.015893
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 5.298 5.707 5.922 6.137 6.566
sigma 2.795 3.055 3.201 3.364 3.720
lp__ -167.540 -165.228 -164.526 -164.113 -163.847
plot(codaSamples)
autocorr.plot(codaSamples)
effectiveSize(codaSamples)
mu sigma lp__
9392.909 8296.156 4092.009
gelman.diag(codaSamples)
Potential scale reduction factors:
Point est. Upper C.I.
mu 1 1.00
sigma 1 1.00
lp__ 1 1.01
Multivariate psrf
1
gelman.plot(codaSamples)
plot(density(codaSamples[[1]][,1]),xlim=c(0,8),ylim=c(0,3)) # mu, 1st chain
lines(density(codaSamples[[1]][,2])) # sigma, 1st chain
lines(density(codaSamples[[2]][,1]),col="red") # mu, 2nd chain
lines(density(codaSamples[[2]][,2]),col="red") # sigma, 2nd chain
Or you can use shinystan to do similar analysis of fitted model:
launch_shinystan(fit)
Create a sample with heavy tails in order to check robust estimation of parameters of normal distribution.
Refer this simulation of Leptokurtic Distributions.
<-1000
nSample<-c(2,3.4,.8,2.6)
sd.Values<-rep(c(rep(sd.Values[1],50),
sd.processrep(sd.Values[2],75),
rep(sd.Values[3],75),
rep(sd.Values[4],50)), 4)
plot(sd.process,type="l")
Variable sd.process
is a deterministically changing
standard deviation. Simulating perfect normal and independent
realizations with such different standard deviations make a leptokurtic
distribution.
set.seed(05062022)
<- rnorm(nSample)*sd.process
y <- y[1:300]
y plot(y, type="l")
<- density(y)
den plot(den)
lines(den$x, dnorm(den$x, mean(y), sd(y)), col="red")
= length(y) Ntotal
Density plot clearly shows fat tails that will be most likely identified as outliers under the assumption of normal distribution.
Create data list.
= list(
dataList y = y ,
Ntotal = Ntotal ,
mean_mu = mean(y) ,
sd_mu = sd(y)
)
Student distribution with \(\nu\) degrees of freedom appears in a sampling scheme from Gaussian distribution with unknown mean and variance.
In such case the variable \[ X \sim N(\mu,
\sigma)\] has the \(z\)-score
\[z = \frac{X - \hat{\mu}}{s/\sqrt{n}} =
\frac{X - \hat{\mu}}{\hat{\sigma}} \] where \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i -
\hat{\mu})^2\), has
\(t\)-distribution with \(n-1\) degrees of freedom.
Then random variable \[ X = \hat{\mu} + \hat{\sigma}z\]
has generalized Student distribution with location parameter \(\hat{\mu}\), scale parameter \(\hat{\sigma}\) and the number of degrees of
freedom \(\nu\).
The parameters \(\hat{\mu}\) and \(\hat{\sigma}\) are not exactly
corresponding to the corresponding parameters of Gaussian distribution,
but \(\hat{\mu} \rightarrow \mu\),
\(\hat{\sigma} \rightarrow \sigma\) as
\(\nu\) becomes small and \(t\)-distribution becomes Gaussian.
Interpretation of the parameters:
= "
modelString model {
for ( i in 1:Ntotal ) {
y[i] ~ dt(mu,1/sigma^2,nu)
}
mu ~ dnorm( mean_mu , 100/sd_mu^2 )
sigma ~ dunif( sd_mu/1000 , sd_mu*1000 )
nu ~ dexp(1/30.0)
}
" # close quote for modelString
# Write out modelString to a text file
writeLines(modelString , con="data/TEMPmodel-jags.txt")
Initialize the model with MLE.
<-function() {
initsList <-sample(c(1,-1),1)
upDown<- mean(y)*(1+upDown*.05)
m <- sd(y)*(1-upDown*.1)
s list(mu = m, sigma = s, nu=2)
}
= c("mu", "sigma", "nu") # The parameters to be monitored
parameters = 500 # Number of steps to "tune" the samplers
adaptSteps = 1000
burnInSteps = 3
nChains = 1
thinSteps =50000
numSavedStepsnIter = ceiling((numSavedSteps*thinSteps)/nChains)) (
[1] 16667
# Create, initialize, and adapt the model:
= jags.model("data/TEMPmodel-jags.txt", data=dataList, inits=initsList,
jagsModel n.chains=nChains, n.adapt=adaptSteps )
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 300
Unobserved stochastic nodes: 3
Total graph size: 318
Initializing model
Run the chains.
# Burn-in:
update(jagsModel, n.iter=burnInSteps)
# The saved MCMC chain:
= coda.samples(jagsModel, variable.names=parameters ,
codaSamples n.iter=nIter, thin=thinSteps)
Explore the results.
summary(codaSamples)
Iterations = 1501:18167
Thinning interval = 1
Number of chains = 3
Sample size per chain = 16667
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu -0.2816 0.1115 0.0004988 0.0006814
nu 4.1298 1.5167 0.0067826 0.0223954
sigma 1.8116 0.1673 0.0007482 0.0021333
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu -0.5031 -0.3565 -0.2804 -0.2055 -0.06739
nu 2.3196 3.1388 3.7905 4.7075 7.98449
sigma 1.5018 1.6962 1.8053 1.9221 2.15579
mean(y)
[1] -0.4222642
sd(y)
[1] 2.495025
Note that the robust estimate of \(\mu\) is similar, but \(\sigma\) is significantly smaller.
plot(codaSamples)
summary(sd.process)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.80 0.80 2.30 2.18 3.40 3.40
autocorr.plot(codaSamples,ask=F)
effectiveSize(codaSamples)
mu nu sigma
26967.896 4580.058 6146.000
gelman.diag(codaSamples)
Potential scale reduction factors:
Point est. Upper C.I.
mu 1 1.00
nu 1 1.01
sigma 1 1.00
Multivariate psrf
1
gelman.plot(codaSamples)
head(codaSamples[1])
[[1]]
Markov Chain Monte Carlo (MCMC) output:
Start = 1501
End = 1507
Thinning interval = 1
mu nu sigma
[1,] -0.35538534 4.495681 1.958775
[2,] -0.06611877 4.258681 1.793895
[3,] -0.05521260 4.103952 1.789491
[4,] -0.24838098 5.631748 1.801702
[5,] -0.24537268 4.109628 1.863111
[6,] -0.19055382 3.354896 2.028280
[7,] -0.56653077 4.627866 1.839904
attr(,"class")
[1] "mcmc.list"
<-lapply(codaSamples,function(z) hdi(as.matrix(z)))) (HDIofChains
[[1]]
mu nu sigma
lower -0.5047759 2.070184 1.501770
upper -0.0678133 7.153532 2.149501
attr(,"credMass")
[1] 0.95
[[2]]
mu nu sigma
lower -0.49556250 2.028008 1.475841
upper -0.07011458 6.820492 2.127922
attr(,"credMass")
[1] 0.95
[[3]]
mu nu sigma
lower -0.50150501 1.966809 1.497709
upper -0.06232077 6.961558 2.143415
attr(,"credMass")
[1] 0.95
Non-robust estimate of \(\sigma\) is outside the HDI for all the chains.
Adapt the model description to t-distribution with additional parameter \(\nu\).
= "
modelString data {
int<lower=1> Ntotal;
real y[Ntotal];
real mean_mu;
real sd_mu;
}
transformed data {
real unifLo;
real unifHi;
real normalSigma;
real expLambda; //New: parameter of prior for nu
unifLo = sd_mu/1000;
unifHi = sd_mu*1000;
normalSigma = sd_mu*100;
expLambda=1/29.0; //New: setting value for expLambda
}
parameters {
real<lower=0> nuMinusOne; //New: definition of additional parameter nu
real mu;
real<lower=0> sigma;
}
transformed parameters {
real<lower=0> nu; //New: new parameter nu
nu=nuMinusOne+1; //New: shifting nu to avoid zero
}
model {
sigma ~ uniform(unifLo, unifHi);
mu ~ normal(mean_mu, normalSigma);
nuMinusOne~exponential(expLambda); //New: exponential prior for nu
y ~ student_t(nu, mu, sigma); //New: student_t distribution for nu
}
" # close quote for modelString
Create DSO.
<- stan_model(model_code = modelString) stanDso
Run MCMC.
<- sampling(object=stanDso ,
stanFitRobust data = dataList ,
pars=c('nu','mu', 'sigma'),
chains = 3,
cores= 3,
iter = 50000,
warmup = 300,
thin = 1)
Explore the results.
# text statistics:
print(stanFitRobust)
Inference for Stan model: anon_model.
3 chains, each with iter=50000; warmup=300; thin=1;
post-warmup draws per chain=49700, total post-warmup draws=149100.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
nu 4.04 0.01 1.54 2.30 3.10 3.72 4.58 7.69
mu -0.24 0.00 0.12 -0.49 -0.33 -0.24 -0.16 0.00
sigma 1.80 0.00 0.16 1.50 1.69 1.80 1.91 2.14
lp__ -514.38 0.01 1.26 -517.67 -514.95 -514.05 -513.46 -512.95
n_eff Rhat
nu 45883 1
mu 87613 1
sigma 62456 1
lp__ 54355 1
Samples were drawn using NUTS(diag_e) at Thu Feb 23 20:11:46 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
# estimates & hdi:
plot(stanFitRobust)
# samples
class(stanFitRobust)
[1] "stanfit"
attr(,"package")
[1] "rstan"
::traceplot(stanFitRobust, ncol=1, inc_warmup=F) rstan
pairs(stanFitRobust, pars=c('nu','mu','sigma'))
stan_scat(stanFitRobust, c('nu','mu'))
stan_scat(stanFitRobust, c('nu','sigma'))
stan_scat(stanFitRobust, c('mu','sigma'))
stan_hist(stanFitRobust)
stan_dens(stanFitRobust)
# autocorrelation:
stan_ac(stanFitRobust, separate_chains = T)
stan_diag(stanFitRobust,information = "sample")
stan_diag(stanFitRobust,information = "stepsize",chain = 0)
stan_diag(stanFitRobust,information = "treedepth",chain = 0)
stan_diag(stanFitRobust,information = "divergence",chain = 0)
There seems to be a pattern in relationship between \(\sigma\) and \(\nu\).
Check if there is dependency.
<- extract(stanFitRobust)
stanRobustChains names(stanRobustChains)
[1] "nu" "mu" "sigma" "lp__"
plot(stanRobustChains$nu,stanRobustChains$sigma)
plot(rank(stanRobustChains$nu),rank(stanRobustChains$sigma))
Interpret the dependency: Degrees of freedom \(\nu\) are related to sample size \((n-1)\). If the df increases, it also stands that the sample size is increasing; the graph of the t-distribution will have skinnier tails, pushing the critical value towards the mean.
\(\Rightarrow\) t-copulas (guess from empirical copulas)
Explore shiny object.
launch_shinystan(stanFitRobust)
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2021, Nov. 20). HaiBiostat: From Unconditional to Multivariate Bayesian Model Workshop -- Series 1 of 10 -- Gaussian vs Robust Estimation. Retrieved from https://hai-mn.github.io/posts/2021-11-20-Bayesian methods - Series 1 of 10/
BibTeX citation
@misc{nguyen2021from, author = {Nguyen, Hai}, title = {HaiBiostat: From Unconditional to Multivariate Bayesian Model Workshop -- Series 1 of 10 -- Gaussian vs Robust Estimation}, url = {https://hai-mn.github.io/posts/2021-11-20-Bayesian methods - Series 1 of 10/}, year = {2021} }